Assignment 1

1

In our first plot, linoleic is a continous variable with colored in hue’s of blue. The channel capacity for hue is only 10 levels and the problem of occusion with the data also deteriorates our understanding of the plot. In the second plot, We have linoleic variable segmented into 4 groups, this gives us a quicker understanding of the data showing us the relative values for the groups. The perception problem of relative judgement is affected as color hue comes with the highest error in human beings.

ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)
##    Region         Area palmitic palmitoleic stearic oleic linoleic
## 1       1 North-Apulia     1075          75     226  7823      672
## 2       1 North-Apulia     1088          73     224  7709      781
## 3       1 North-Apulia      911          54     246  8113      549
## 4       1 North-Apulia      966          57     240  7952      619
## 5       1 North-Apulia     1051          67     259  7771      672
## 6       1 North-Apulia      911          49     268  7924      678
## 7       1 North-Apulia      922          66     264  7990      618
## 8       1 North-Apulia     1100          61     235  7728      734
## 9       1 North-Apulia     1082          60     239  7745      709
## 10      1 North-Apulia     1037          55     213  7944      633
##    linolenic arachidic eicosenoic
## 1         36        60         29
## 2         31        61         29
## 3         31        63         29
## 4         50        78         35
## 5         50        80         46
## 6         51        70         44
## 7         49        56         29
## 8         39        64         35
## 9         46        83         33
## 10        26        52         30
ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()

ol$disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()

2

The 2nd plot is the easiest to analyse the plot with linolenic segmented into 4 groups. The size mapping creates the problem of occlusion due to overlapping. The orientaion angle map does not help either as the scatter plot many observations creates a high relative judement error. a)With Color hue 10 levels of feature can be percieved and 3.1bits can be decoded,With Color Brightness 5 levels and 2.1bits can be decoded. b)With size of object 4-5levels of feature can be percieved depending on human subject’s individualistic abilities, and 2.2bits can be decoded for this aesthetic. c)line orientation : 3bits can be decoded for this feature.

ggplot(ol, aes(palmitic, oleic, col = ol$disc)) + geom_point()

ggplot(ol, aes(palmitic, oleic, size = ol$disc)) + geom_point()

levels(ol$disc)<-(0:3)*(pi/4)
ol$disc<-as.numeric(as.character(ol$disc))
ggplot(ol, aes(palmitic, oleic)) + geom_point() + 
  geom_spoke(angle = ol$disc, radius = 40)+
  ggtitle("Scatter plot of palmitic vs oleic discretized 
          by linolenic Orientation angle")

3

In the first plot, Region is considered numeric and plotted with color brightness. This makes it apparent that the different regions are interrelated but actually, no such relationship exists as Region is a categorical variable. Treisman’s theory of preattentive processing is showcased in this example, with the second plot we see the same much quickly due to preattentive preprocessing of contrast and luminance as color hue is mapped to the categorical variable Region.

ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()

ggplot(ol, aes(oleic, eicosenoic, col = cut_interval(Region,3))) + geom_point()

4

The 3 colors are each mapped with contrast and size and these feature maps are parallely processed in our brain as this creates a problem while analysing 27 different types of observation with 3 levels of mapping. Human channel capacity is limited to 10 levels of hue (3.1 bits), 5 levels of brightness (2.3 bits) and 4 to 5 levels of size (2.2 bits). On an average, we are also limited to 6-7 levels of different observations (2.6 bits). When using multiple mapping all at once, the channel capacity does not linearly increase as the sum of their their individual channel capacities. With size,brightness and hue used together, the channel capacity is 4.1 bits but the sum of the channel capacities is 7.6bits. Due to this, we cannot interpret the plot easily with preattentive preprocessing.

ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = 
                 cut_interval(ol$linoleic, 3),
               shape = cut_interval(ol$palmitic, 3), 
               size = cut_interval(ol$palmitoleic, 3))) + 
  geom_point()

5

Size, contrast and shape are individual feature maps that are linked to different colors and hence preattentive preprocessing helps in this case. We can see a clear decision boundary amongst different regions. According to Triesman’s theory, the human visual system splits different features into separate maps and processes them in parallel. This enables the system to ignore non-target information contained in the master map. This is seen here as the Region variable has clear decision boundaries that can be immediately observed with the respective color. This makes identify the clusters based on Region inspite of many other features in the plot.

ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
               shape = cut_interval(ol$palmitic, 3), 
               size = cut_interval(ol$palmitoleic, 3))) + geom_point()

6

Relative Judgement due to area is very high due to the plot made as a pie size as the dominant group of South- Apulia looks much larger than the other groups.

p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = TRUE, textinfo="text", text="") %>%
  layout(title = 'Pie Chart Area')
p

7

It is hard to look for outliers in the contour plot compared to the scatter plot. The extreme values are not plotted in the contour plot. It is also hard to figure out clusters in the contour plot compared to a scatter plot which is a big issue in this plot.

In the contour plot it shows we have 5 peak values but you wont be able to spot any difference for it in the scatter plot. At some of the peaks shown by the contour plot there are no points there in the scatter plot, which can be very misleading. Like the peak in contour at approximately (900, 12) there are no corresponding points there in the contour plot. So contour plots can be misleading sometimes.

ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()

ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()

Assignment 2

1 - Read Data

The columns vary a lot in the range. Some values like BAvg are averages so are in a range of 0.235 to 0.282, while values like TB are in the range 2090 to 2615. This is the reason scaling is required before we apply Non- metric MDS. Scaling the data gets all the values in the same range, this would allow the NMDS algorithm to reduce the dimensions of the data more efficiently.

##                     League Won Lost Runs.per.game HR.per.game   AB Runs
## Aizona Diamondbacks     NL  69   93          4.64    1.172840 5665  752
## Atlanta Braves          NL  68   93          4.03    0.757764 5514  649
## Baltimore Orioles       AL  89   73          4.59    1.561728 5524  744
## Boston Red Sox          AL  93   69          5.42    1.283951 5670  878
## Chicago Cubs            NL 103   58          4.99    1.236025 5503  808
## Chicago White Sox       AL  78   84          4.23    1.037037 5550  686
##                     Hits X2B X3B  HR RBI StolenB CaughtS  BB   SO  BAvg
## Aizona Diamondbacks 1479 285  56 190 709     137      31 463 1427 0.261
## Atlanta Braves      1404 295  27 122 615      75      34 502 1240 0.255
## Baltimore Orioles   1413 265   6 253 710      19      13 468 1324 0.256
## Boston Red Sox      1598 343  25 208 836      83      24 558 1160 0.282
## Chicago Cubs        1409 293  30 199 767      66      34 656 1339 0.256
## Chicago White Sox   1428 277  33 168 656      77      36 455 1285 0.257
##                       OBP   SLG   OPS   TB GDP HBP SH SF IBB  LOB
## Aizona Diamondbacks 0.320 0.432 0.752 2446 117  50 43 38  43 1113
## Atlanta Braves      0.321 0.384 0.705 2119 145  59 64 52  60 1161
## Baltimore Orioles   0.317 0.443 0.760 2449 119  44 17 36  19 1065
## Boston Red Sox      0.348 0.461 0.810 2615 137  43  8 40  34 1162
## Chicago Cubs        0.343 0.429 0.772 2359 107  96 42 37  45 1217
## Chicago White Sox   0.317 0.410 0.727 2275 122  53 29 44  16 1105

2 - Non-metric MDS

It is hard to see a difference between the legues in this plot. We could say that the National League(NL) teams are spread out away from the origin, and the Anerican League(AL) teams are more centered towards the origin.

The points are well spread out so it is hard to tell if a MDS component is providing better differentiation between the leagues. In my opinion V1 was doing a better split between the leagues compared to V2.

According to this plot “Boston Red Sox” and “Atlanta Braves” look like outliers.

## initial  value 19.856833 
## iter   5 value 16.319153
## iter  10 value 16.046215
## final  value 15.935476 
## converged

3 - Shepard Plot

MDS was able to decrease the stress value upto 15.6%. Given that the dataset had 26 dimension and getting it down to 2 dimensions, with stress level of 15.6 is good.

Some of the observation pairs that were hard for MDS to map were -

“Orkland Athletics and Milwaukee Brewers”, “NY Mets and Minnesota Twins”, “Minnesota Twins and Arizona Diamondbacks”, “Orkland Athletics and Chicago cubs”, “Pittsburg pirates and Chicago cubs”

4 -

Since V1 was spliting the leagues better, I plotted all the variables against it and found that “RBI” and “OPS” had a strong negative connection with V1. On searching for these on google we found, these two turned out to be really important factors in baseball to differentiate teams and rank them.

RBI(Runs batted in) - RBI is a statistic in baseball that credits a batter for making a play that allows a run to be scored. The top teams in the league have a high RBI. It is an important batting statistic in baseball.

OPS(On-base plus slugging) - OPS is a statistic calculated as a sum of players ability to get on base and hit with power. Usually the league leaders have players with highest OPS. This is an important factor that increases runs scored for a team.

Both of these RBI and OPS are important batting statistics in baseball.

plots_bball$P12 #TB  **

plots_bball$P20 #OPS  **

Appendix

library(ggplot2)
library(plotly)
library(xlsx)
library(MASS)
library(gridExtra)
#Assignment 1
#Q1
ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)

ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()

disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()


#Q2
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic, size = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic)) + geom_point() + 
    geom_spoke(angle = ol$linolenic, radius = 40)

#Q3
ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()


#Q4
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = cut_interval(ol$linoleic, 3),
              shape = cut_interval(ol$palmitic, 3), 
              size = cut_interval(ol$palmitoleic, 3))) + geom_point()


#Q5
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
              shape = cut_interval(ol$palmitic, 3), 
              size = cut_interval(ol$palmitoleic, 3))) + geom_point()

#Q6
p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = FALSE) %>%
  layout(title = 'Pie Chart Area')
p

#Q7
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()


#Assignment 2
#Q1

bball = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1", header = TRUE,
                  row.names = 1)
head(bball)

#Q2
bball.numeric = bball[,3:27]
distance = dist(bball.numeric)
res = isoMDS(distance, k=2, p=2)
coords = res$points

coordsMDS = as.data.frame(coords)
coordsMDS$name = rownames(coordsMDS)
coordsMDS$league = bball$League
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", mode = "markers"
        , hovertext=~name, color= ~league)


#Q3
sh <- Shepard(distance, coords)
delta <-as.numeric(distance)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(bball)[index1],
                            '<br> Obj 2: ', rownames(bball)[index2]))%>%
  add_lines(x=~sh$x, y=~sh$yf)

#Q4
bball$V1 = coordsMDS$V1
bball$V2 = coordsMDS$V2

cols_bball = colnames(bball)
dim(bball)[2]
plots_bball = list()
for(i in 2:27){
  pl_name = paste("P", i, sep = '')
  col_name = cols_bball[i]
  plots_bball[[pl_name]] = ggplot(bball, aes_string("V1", col_name)) + 
    geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball, ncol = 6, nrow = 6)

plots_bball$P2
plots_bball$P3
plots_bball$P4 #Runs per game
plots_bball$P5
plots_bball$P6 #AB
plots_bball$P7 #Runs **
plots_bball$P8 #Hits **
plots_bball$P9
plots_bball$P10
plots_bball$P11
plots_bball$P12 #RBI **
plots_bball$P13
plots_bball$P14
plots_bball$P15
plots_bball$P16
plots_bball$P17 #BAvg
plots_bball$P18
plots_bball$P19 #SLG
plots_bball$P20 #OPS  **
plots_bball$P21 #TB  **
plots_bball$P22
plots_bball$P23
plots_bball$P24
plots_bball$P25
plots_bball$P26
plots_bball$P27

plots_bball2 = list()
for(i in 2:27){
  pl_name = paste("P", i, sep = '')
  col_name = cols_bball[i]
  plots_bball2[[pl_name]] = ggplot(bball, aes_string("V2", col_name)) + 
    geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball2, ncol = 6, nrow = 6)
plots_bball2$P2
plots_bball2$P3
plots_bball2$P4
plots_bball2$P5
plots_bball2$P6
plots_bball2$P7
plots_bball2$P8
plots_bball2$P9
plots_bball2$P10
plots_bball2$P11
plots_bball2$P12
plots_bball2$P13
plots_bball2$P14
plots_bball2$P15
plots_bball2$P16
plots_bball2$P17
plots_bball2$P18
plots_bball2$P19
plots_bball2$P20
plots_bball2$P21
plots_bball2$P22
plots_bball2$P23
plots_bball2$P24
plots_bball2$P25
plots_bball2$P26
plots_bball2$P27